DuckDB  The SQLite for Analytics (Mark Raasveldt, CWI)

Created with glancer

so we are super excited to have mark res belt from CWI come give a talk for us today mark did his PhD CWI and now is halfway in between being a postdoc and halfway in between being a PG student because you haven't defended yet yeah but you are officially post-op are they pay and you like a postdoc sorry I paid
you here paying me almost like postdoc it's a see you are in between mmm in between I'm a pre doc it's kind of nice you know you're a pretty postdoc a pre post up pretty postpone dog post wound up okay so I duck TB is probably in my opinion the most exciting academic database system coming out in the last two years and as the title says its equal life for analytics so I don't
equal life for analytics so I don't think there's anything else to say ie marks all summed up TV's awesome so go for it and it's just a way will do this is people can just unmute themselves and ask mark as we as we go along with questions yeah no problem yeah so at any point just don't literally shouts but it is a bit late here so I guess my neighbors wouldn't like that but just ask that's fine all right so I guess I don't need to introduce myself I could have skipped
this slide thank you for the introduction and the invites and I'm very happy to the year as well so yeah me and hummus in this picture you can see honest and me down there drinking beer at almost Oktoberfest we made dr. beam the database system and I'm here to present that's here today I just interrupt go back to the answer for everyone go back to last big bus line either middle that's Peter bonds that's the inventor of vector wise yeah another
sort of like see you i Christ out an amazing systems no yes yes yeah so Peter is not super involved in actually building dot to be in terms of like the code base but we do definitely ask him for advice quite a lot but he is way too busy with tons of other things like he is super busy he has no real time to code that much anymore so when we definitely talk to him a lot I mean we used to share an
office before all this mist all right so basically I have kind of two parts of this presentation one the first part is kind of a talk I gave already at cider because well it's too late to prepare anything new and then I have also a bonus part lessons learned building a database system which i think is interesting or could be interesting to people that might be building a system right now like Andy all right so before
I talk about database systems I first want to give a bit of a motivation for why we build duck TB in the first place and that motivation it kind of started with data science well what is data science I just called statistics on MacBook as a thousand different words for it it's very overloaded term but I think one thing that's kind of obvious about data scientists is they work with data it's in the name right and then we have us the database researchers in our
fields we've been working for more than 50 years I think at perfecting database systems so obviously they have scientists that have made systems they have to be using them right well unfortunately not they use them if they are forced to such as if they have to take data out of the database system in the first place but they generally kind of prefer to avoid them which is a bit sad right it makes the database set so here's a quote from Hadley Wickham I can't actually read it because I blocked
can't actually read it because I blocked my quote with this overlay if your data fits in memory there is no advantage to putting it in the database it will only be slower and more frustrated so if you don't know Hadley Wickham he's kind of like diarga he made he made kind of all the popular our pack just like ggplot deep liar - huge guy in data science and he is saying this about our feet about our like our lives work in a lot
of cases so it's a bit sad so why saying that well maybe they just don't need database technology right so what do data science actually do well I think probably the easiest way to figure that out is just look at the packages they use right so if you look at like popular data science frames in Python you see the top two or numpy and pandas which is pure data wrangling right in our it's very similar the top two deep lie our table and down there data table
lie our table and down there data table it's all data wrangling libraries and what do they all do well they do database operations they do sometimes a bit more but some of them do just database operations or even less than that they do joins aggregates basic transformations filters so it's kind of turns out that data scientists they do need all this functionality that we are that we are building in our database systems but somehow they're not using them so let's go back to this quote why
does he think that database systems are slow and frustrating right I mean we've been working on optimizing database systems for what 50 years or more not not me personally but our field so to figure it out probably the east ways to just run a data science workflow and with and without a database system and then just see what happens so typical database data science workflow it's kind of simple you load some data from the CSV file you do some ETL so
CSV file you do some ETL so pre-processing they run your analysis like a really simple sort of workflow so for our example that we ran here we have some voter data from North Carolina voters it's just a data set we have lying rounds which is around 300 megabytes in CSV format then we do some in this case very nonsensical pre-processing where we remove entries oh no this is not nonsensical we remove entries with a missing phone number or house number and then we compute the
correlation between phone number and house that is the nonsense part I couldn't read it again because this thing is blocking my view so of course correlation between phone Murnau's number expected zero so let's do this in our first we install some required packages let's install like data table and apply some basic wrangling packages then we run our analysis first read the CSV file then we apply the filter and then we'll check the correlation it's three steps three
lines of code and indeed receive no correlation between phone number and house them surprised it's quite quite easy and convenient so what do we what do what happens if you do this with the database err well for that let's just use Postgres because it's kind of the open source database well first we have to install the database server and this is not actually trivial without a separate package management and for for us at least on our cluster we don't have root access so we have to download the
root access so we have to download the sources in compiler then after we do that we have to set up some cryptic environment variables initializes database and use this very descriptive PG cuckoo commands because in the when this was created they had to save lights on the commands and initializer database then we install the clients at least this is easy right we just install it through the our package manager except it doesn't work because the client actually requires you to have the
actually requires you to have the database system itself on your path for some reason ok fine we put the server on the path to install the client install it again and then we run now we connect to our database system providing like the host the database name the clients whatever we create a table so I have it I'm actually being nicer because you see this dot dot dots this is around a hundred different calls because it's a very wide data set it's very very very big statement actually then we send a
query where we use like string concatenation to construct a copy commands to read the CSV file then we can actually use a deletes read the table and use a field so you can already see like all of these steps they're pretty annoying to do you're like if the installation is more difficult usage is more difficult there's a lot of like pains here what about the time well it's a lot slower as well and the the whole reason slower is actually because of this
because of this client-server connection where you are basically pushing like this whole table which is around 300 megabytes over a socket connection and that just costs you like 10 extra seconds here so it's annoying and slow like in the end Hadley is right for these types of workflows are in Python database systems they are frustrating they are slow so what are the problems here and how can we fix them well the problems that we have kind of identified is that well they're difficult install
is that well they're difficult install difficult to set up and maintain they have a really slow data transfer to and from the clients and I kind of have a poor interface with the client application so of course you have its client-server architecture it's nice for multiple users but for a single user to start pointless it just makes everything more cumbersome the installation needs to be easy if you use Python you need to be able to install it all through pip install anything else is already way too
cumbersome because then you're already looking into like OS specific package managers like figure it filling with compilers and that's just not happening it's too much transfer efficiency is very very crucial which also we have a paper about in fuel to be some years ago actually when the picture you saw before of the Oktoberfest was taken was this feel to be so the transfer efficiency like the database flying particles they're not optimized for this kind of
bulk transfer scenario and they got into transfers necessary because the data scientists they can't run everything within the server right they need to use their like machine learning framework of the week of the month whatever to run their data science for plotter and I'm gonna do like a deep neural network within sequel so they needs transfer this data at some point and finally its sequel it's a lovely language I very much like sequel but as an interface for clients it's kind of
interface for clients it's kind of problematic because you have to embed sequel as a string and it's kind of a natural right you don't get syntax highlighting in most editors and it actually has introduced the whole class of security format this interface which we call sequel injection it's also not that easily composable so if you look at deep liar for example you can change operations so you have like a filter and a projection you can chain them and you can look at intermediates by just
pushing like this head function in between at any point and simple you can kind of do that with temporary tables but you have to write a lot of cruft you see like there's an extra select start from being pushed everywhere there's a lot of like Crufts around it they just don't need in this deep liar like interface so well the cure database systems they're not a good fit for data science why do we care well the answer is simple money data science is very popular so we
money data science is very popular so we can make money we can get grants and then we have but maybe a more like fundamental reason is that data scientists they are kind of reinventing database systems right there they are re-implementing joins these areas they are kind of recreating a database system and are doing a very poor job of it they're basically ignoring all the research that we as like in the database field at worked on in the past 50 years they don't use optimizers they materialize huge intermediates they don't support memory computation they
don't support memory computation they don't have any parallelism they don't scale up so they are kind of making it like a poor-man's database that is just like winning or like that then it gets a lot of users just because they have a good interface like a good a good connection to the arm Python communities and database they can solve all these problems Oh muck yeah wonder you can may ask the question if you don't mind of course yeah yeah oh by the way my name is Ling I'm a PhD student here nice yes
is Ling I'm a PhD student here nice yes so I was just like the previous previous few slides and also the early introduction of make me thinking that um especially at the beginning of the talk when you are talking about the most popular libraries in Python which are now I and pandas right just for my own perspective I'm using those libraries yeah I actually feel like the the functionality and a feature of of at least for numpy and pandas are a little bit different from my perspective I feel
like in pandas you have lots of like this kind of like database gridded operations like a filters drawings transformations etc but in numpy many of time you are using those like rhythmatic or numeric created operations like the two matrix modulation that's that's a different different side do broadcasting or you have this very convenient interface indeed ability to interface with very community interface to quickly index a matrix at different rows columns etc so I think the atom part would be a
very important reason for numpy you definitely have the matrix stuff as well that is an excellent point indeed that is different but still a lot of the stuff in Vampires is also present in database systems and you can like that's also kind of the reason why you need this convenient get a bridge right between the two because once you start wanting to do things like matrix multiplication for which the relational
multiplication for which the relational models may be not so suitable they can easily like go back and forth between what each system has like what each system is good at the strength of each system right okay okay I see okay that makes it thank you for your question all right so where was I here I think yeah so this was actually kind of on the bridge between two things indeed the database are slow and
frustrating but they don't need to be cube by database dr. B our database which we call this sequel ID for analytics the the reason we call seek light for analytics is because sequel light is really like the most easy to use database if you have used that and we really really like it but it has one big flaw and that is that it's quite slow once you have to do any sort of analytics so the core features of duck PB are actually very sim similar to sequel lights with a few big exceptions
there's very simple installation I would go a bit more in Depp there it's embedded so you don't have any external server management we have a single file storage format just like synchronize we have fast analytical processing and we have very fast transfer to an R Python and the database system together with a convenient API for this so the first question you always get is white duck TB well Ducks are amazing animals they fly they walk they swim their resilience they can live
they swim their resilience they can live off anything and Honus used to own a pet duck called Wilbur here's a picture of Hannes my former supervisor now a co-worker still kind of supervisor with his duck on the boats where he lives because he's a bit crazy but the good cat so we see he used to own a duck where's the dock now look the duck found greener paste paste pastures helium flew a wall to form a little duck family we
reassume weights it was like it was a pet duck in his and his boathouse and then it just won me like I'm out of hearing from away well the duck has complete freedom so would just be on like the you wouldn't be in a cage you know you would just flip swim in the canals cuz not the canals Bam's for him it would just swim around fly around fly back and sometimes he would already be gone for like one day two days but usually come back at some point he just didn't come back so he assumed be met a lady duck and made a beautiful family
lady duck and made a beautiful family did this happen do the duck disappear before you guys started up TBH yeah yeah that's a great backstory for a name thank you it was a very sweet duck I watched it grow up so I'm I'm still a missing Wilbur alright so well dr. B it's a very young project it was the first commits I made in the one year and
nine months ago the 13th of July 2018 as you can see there's a very bare-bones commits I just added a parser and the initial system was actually heavily inspired by peloton so when I started making duck TV I just looked at the parser and the binder from peloton and just took like half of it so here you can see a code snippet from the early duck TV and if you are still familiar with the peloton code base maybe it looks a lot like it because it is actually that many names
are still present so we have a data table bind context string util but by now it's mostly been rewritten besides some small things the string util methods are still there and they're still the same exception types so here's the internals at a kind of a glance like it's a columns for database surprise it's analytics I'd use vectorized processing that is actually probably a surprise to some people that's prefer
jits there's a lot of reasons why we chose for vectorized we have had a lot of debates about this the main reason though that we decide on Ischl E was because we wanted to avoid the compiler dependency which is very problematic if he wants the user to install the system but there are a lot of good sites to vectorize processing still when you compare the Ridgid they both have their advantages we use an art index we use MVCC it's inspired by the
index we use MVCC it's inspired by the one by hyper but has been adapt its kind of to be optimized for a vectorized engine and we use single files for it which looks a bit like this it's really C++ and we use the post pressed parts which was heavily inspired by the peloton Raptor out that parcel can you talk about the changes you made to the hybrid bzz like to deal with a vectorized engine well it wasn't so much deal with a vectorized engine as it was to deal with bulk updates and Colin or
updates so I'm actually still looking to write a paper about this but I'm kind of procrastinating yeah the thing is the hyper MVCC it has one big pitfall which is that it's always copies entire rows to the undo box I don't know how familiar you are with this we've done the same thing you implement okay so it has one big pitfall which is that it does this like copying of entire rows and do buffers which for inserts and deletions is kind of fine but for updates gets pretty mess see especially when you have to deal
see especially when you have to deal with like large columnar updates because if you touch say like if you run updated copy the updates like one column but like ten percent of the column there we're copying ten percent of your table into an undo buffer and it might be one call out of one hundred calls so we think mad we fix that right we don't do that either is that correct yeah yeah and then the other the other missed the other issue with hyper MVCC is the
other issue with hyper MVCC is the single writer and i think we got it we gotta run that to the single writer yeah there's like some sing like they only have one writer at a time or so some some aspect of it like we don't really have this with a at least we have one we have parallel writes to different tables and also to the same table kind of with some like switch around a lot on some latches here and
there yeah but one thing that we have is one committer maybe that's when we look lighter to the right a headlock one committer this is what yeah we don't we don't either I'm not sure this but we are also like knots really that's concerned with like very small like point updates or pointer right we are mostly actually concerned about bulk and that's why we made these like changes to the MVC to deal with bulk updates specifically so we have like a different versioning also like
like a different versioning also like our versions are actually vectors not individual rows also to deal with this I I will write a paper about this at some point I also have a bunch of slides I can send you if you want to take a look yes we should um the student that wrote the MVC stuff with Matt is not here he's now peaches you to MIT but we swap notes at some point yeah yeah for sure that sounds excellent yeah I spent quite some
time on the MVC first on implementing it in the first place and then I'm revamping it so definitely it sounds revamping it so definitely it sounds excellent excellent sorry alright no notice okay this slide actually copied here again alright so well some like oh no actually I didn't accidentally copied AHA so let's go back to the database problems that we had before and how does dr. B kind of address these problems so first we had difficult install difficult setup slow then a transfer and a pouring face all
right so first installation and usage here we actually learn a lot of lessons from sequel I'd first off it's an embedded system it has 0 external dependencies so all the dependencies are completely in lines we have an amalgamation build of the system which means that we can compile our O's all our source files that you need to build the whole system into two files a header and a source file it takes some time to compile they're like a few minutes but it does work we have also integrated
it does work we have also integrated duck to be entirely within the package manager so you can just type install duck TB and we'll fetch the latest version is that it actually compile it when you go pick install but what does that do it's at well depends on the system I think on Linux it compiles on OS X and Windows it fetches a binary so it just gets like a DLL and dumps it there and also from the Python background yeah but you can actually get wheels so we Oh sis like yeah broad distributive binaries so
like yeah broad distributive binaries so we do that we often to speed up our talk of your time yeah we use wheels as well so I actually wanna set this up so the the wheel parts I don't have that much insight since he said that but we do use the wheels yes so like how mean for the ones like Windows should be pretty that environment we keep pretty consistent I guess OS sense as well but I guess like what happens that they have some
funky wood you know they're using the North Korean version of Linux and they try to install this and if it doesn't compile do you get an error and you get a message or it is fairly and you don't usually the way we set up Python this type of c++ dependency code is that it will fall back on compiling so if you can't find a wheel so look the system compiler to compile from source remember the question is I so if
source remember the question is I so if it falls back to compilation and you're running on the weird version of flavor of Linux and compilation fails to do just duck TB get notified that it then failed or this or nothing I'll leave the open an issue we don't have any sort of reporting back or any of that like we don't we don't believe in that either so there's like we don't think we're putting back not like invasively let's say of course let's
keep going with a kernel fine all right sure all right so one more thing well simple installation Doug DB does not rely on any external states which is separate config files or environment variables or like locale things which can be pretty messy otherwise and we have the single file format which users typically find easier to handle see for example Microsoft Office then we are also having that we also have a composed Wayne's face which
is still kind of a work in progress and this interface directly connects the you to the duck TB system without having to type sequel queries you can still type sequel queries we support both and ideas that you can compose these sort of operations so if you want to have say a table scan followed by filter followed by and aggregates you can have this sort of like DB table then with the name of the table you wants that gives you a
table scan then you build a filter on top of that with like a expression and then you can inspect any sort of intermediate there then you can run any errors on that again like all of this kind of return is like relation in relation out and it's kind of a very hidden mapper over the logical operator tree we have internal one nice part also about being embedded is that we can integrate very easily with things from the source environment so we have a scan
the source environment so we have a scan of the pomace data frame and we can produce also a bonus data frame results and the scan of the Panos data frame is basically zero copy except for strings which nasty and that will just run a scan of reponse dataframe go through the entire duck to the query engine and then on the other end it can also be just upon his data frame again and then you will benefit from the vectorization and from all of our entire engine basically can just run whatever you want on upon this data frame we can also run just seek one
data frame we can also run just seek one upon a state frame no problem um which is quite nice yes sorry interruption just wonder how can you do a quick scan or data frame it's like your storage form is compatible with data frame format or you do some transformation it's like there is four so it's not about the storage formats it's about our scan operator that emits like vectors basically right and these vectors can point to an array of data and for in stores and doubles and such
it's just an array of integers or doubles or what-have-you and palmas just has the same airy internally right so because we also run inside the process of the user we can just get that area and run on that in our system for strains it's a bit more complicated because for Strings they have Python string objects a bit more of a mess but for numbers you can do it without any copying just set some pointers and go I think lens question is that duck DBE
knows about the data frame for yeah yeah yes yeah we have a function which is essentially a table producing function that has like a pointer to a data frame and then just scans that in memory according to mark it sounds like that they'd have frame format if just array except for strings internally you promise data frame is just a bunch of numpy arrays right yeah sure and the non PI rays are just erase just see your race right and we also use
race right and we also use series right yeah so kind of this conclusion of this part Doug Lee be pretty future completes not entirely we still need parallelism which I'm working on and compressed storage which I will work on after that we run the full TVC engine almost the full T pcs I have not bought it too important for outer join yet which makes us fail to queries but I get to that as well why I've asked it we have window
why I've asked it we have window functions complex aggregates we have arbitrary sub queries that's also very fun by the way arbitrary sub queries I can give another talk about that we have compressed execution and I can't read the other one string collation yeah yeah important that this more that was from so did you did you use like I see you actually have utf-8 proc it's called and it's a very lightweight alternates ICU that's created by the developers of
julia if you know this and it doesn't actually support all the things that I see you does but it comes quite close it supports like string composition decomposition its force like a graphene cluster detection these sort of things an anorak it's not as impressive yeah so thanks this is what we use because we really don't want to depend on ICU so we
don't actually support full string collation either like the full like all locales shits not exactly but hey we probably will add that as like extension at some points because well you don't want to link I see you just for that I think I think that's a big waste it's a big thing yeah so dr. be free open source very pre-release we have a website github feel free to try it or take parts of it if you're by chance making a database system and if you find
making a database system and if you find anything that breaks please send us a bug report and then that was the end of my cider presentation but I think I will continue a bit because I still have some lessons learned that I think might be interesting for people that might be coincidentally also developing a database edge potentially allegedly all right so just some quick notes testing what we have learned is that for making a database system it's probably the best
to always just test sequel in in a way database developers we are kind of blessed because we know the requirements of our system right it's the run simple curriculum and the internals like the internals of the system they're only useful if they work towards that goal of running sequel so testing internals it's it's kind of easy to do but it's a bit of a trap we have found it makes changing them later on very difficult and you will kind of need to change them if you want to refactor
them at some point so some kind of history maybe we refactored irie factoids slash we wrote the vector implementation twice once to add support for arbitrary vector buffers wants to add support for compressed execution I reworked the binder like three times to get arbitrary sub queries to work that's annoying that's yeah if you want actually to take my or our binder feel free it's does support arbitrary sub
queries I'm sorry Karla correlated sub queries so it's it's quite I'm pretty happy with the implementation we have now there's still a bit I want to change but it's it's kind of funny the binding right it's one of those things nobody talks about but it's actually quite complex like it's a lot of things happen yeah yeah so the planner is well rewrote a bits and I think when we were rewriting this we were quite lucky that we did not have many internal tests and
the ones we did have they actually cost us some paints like we had a bunch of tests that tested the vector representing x' directly and I had to like fix those while it was refactoring it was just kind of annoying I ended up just removing a lot of those so I think test sequel is kind of the holy grail of it it actually makes it a lot easier later on to refactor your system another points make testing fast if testing is slow you don't do it as often it's also kind of developer time you're
it's also kind of developer time you're waiting for this so we have split testing until like a fast and a slow path the fast test just tests functionality on small data sets and it's they actually test a lot like you can see here is the fast test they run in seven seconds on release mode and they had trigger like 150,000 assertion they I think they get like 93 92 percent code coverage or Saugus but they test only small data set so it goes very fast
only small data set so it goes very fast and then we have a slope app that tests like larger data sets the test concurrency so where we have like parallel updaters parallel inserts where we test like trying to like since it kill to the database see if it gets corrupted those sort of things that's all on the slow path those who run as well but when you are just developing something or refactoring you just run the fast test and then for us that's very convenient what what what unit test framework are you enjoying that what is it can I come grown or is it -
what is it okay yeah yeah yeah we really like catch because it's again no external dependency it's just one hitter one annoying thing though is that because it's one header it's the slowdown compilation somewhat which actually we use unity builds for now so any sort of files that are like together in the same folder we all compiled together it's like one batch because otherwise the compilation takes like
twice longer and it's not super long but it's kind of annoying if you're running on a Mac book of like five years ago you know like it's gonna take like at least like three four minutes for a full compile so it kind of blocks you take out of the flow um all right so verification code I think that's also very important in the fast test we actually run every select statement four times once the original query obviously
then we run the query with all optimizers turned off then we run the copied query so we have like a copy for expressions for select nodes like everything you can copy so we copy the curry and run again to verify that the copy works and then we also serialize and deserialize and run again to verify that the deserialize in serialize minutes work and i think this is very nice actually because this copy and serialize these here lies it's very hard to test for but this basically makes it
just work because if it doesn't work if you like to forget to one elements to copy it then your query is not going to return the same result that you just catch when you say I'm optimized it's you mean like I'm not my C code or I'm optimized like in the planets my sequel so we turn off the pool optimizer we run whatever comes out of the planner is straights just got it okay so we actually this is why we only do this obviously for the fastest because this also means we don't run join optimize
also means we don't run join optimize optimization which means we just run a bunch of cross products usually which is actually great to catch bugs in your joint implementation because now any sort of join will be run against a cross protocol filter and if they don't match up probably have a problem your joint so interesting okay this has caught tons of bugs for us like it's really really really good for it has been very good for us so next next advice I can give is make an
next advice I can give is make an embedded database even if you're not making an embedded database make one anyway for testing in the ends the the kind of the wrappers around a sort of database server it's kind of a wrap around embedded database anyway and an embedded database just makes it's much more convenient to write tests and much faster because you don't need to deal with this whole client service stuff and if you are sending like individual sequel statements that are running on
like one row two rows all this like overhead actually adds up and it makes your tests run a lot slower which again is kind of annoying if you have to run the mullet so we have gotten a lot of like a good stuff from having an embedded database so before this I worked on a DB which does not have that or did not have that until we actually added this and they are testing framework it takes around half a seconds I think for every single test they run
I think for every single test they run because they start up a database server startup client connect to the database server around the query could check the results and as a result they they just run a lot fewer tests like they like they're fast tests take like 30 minutes to run or steak-like few seconds and we actually run more tests in the time so it's it's like it really makes a difference in how much you can test as well right because there's always a limit in in terms like
well then automatic benchmarking is very useful I know Andy is has an advanced or wants to add it work in progress we're in the project okay so it detects unexpected regressions which is very useful like here's something that actually found this morning because one thing that we don't have yet which we also want is automatic notification when it actually breaks so here was a test it has been running for a year that broke down in December and we only now find
out because nobody was actually looking at this test because it's not it's just somewhere in our micro tests but actually found kind of a bug that is here you can actually see by the commits we have like a list of commits if you go on our website you can see it and you can see hey this commits actually broken so you can like you know it's here and what actually happened here is that this between expression broke the index optimizer which made this thing not use an index scan anymore but a regular scan so great good catch we probably not have
so great good catch we probably not have catch you don't catch these things in unit tests as much right because the answer is still correct all right kind of the last point I think be careful with dependencies it's something that we are kind of we hold so hearts partly because we are forced to by our work environment to hold this to heart dependencies they are very easy for the developer but they can be very difficult for the user for example they're running a different operating system they might not have a package
system they might not have a package manager or they might not have routes and this is actually our scenario on our work machines we don't have roots and which means we can use the package manager which really sucks when you want to install things really sucks if they have a lot of dependencies otherwise it's fine and of course packages can break we all know about the left path fiasco I think you have a dependency chain just everything can break down if you have like one board package in there so the
Pens there are live bill so you should use them with caution don't include a dependency on boost just to par as one number from a string please I actually found we actually found this in the D Python parkade reader like deep - Parque reader that did this it had it has like 20 dependencies but this one was the most dependencies but this one was the most ridiculous ridiculous it was parsing one number and it pulled into a hole of boost for that ridiculous and here's a nice quote from the tiny
and here's a nice quote from the tiny verse people who are kind of on a mission to to have less dependencies every and every dependency you add to your project is an invitation to break your project right yeah so yeah like I said we can use route on work machines if I have not complained about some up I actually tried to use terrier there and I got met with this message so I was not I was not I was not so happy I couldn't use it without first compiling G email
so maybe for you also it would help me a lot if you guys were to not use the pens okay so hard let's come on things one is to you guys WG email no we don't have it right now we probably will add a edits but if we added we might even inline its or as like an all tighten right so so we find it gives us like a 20% boost yeah that's why we have it um the so you say you have no dependencies right to
say you have no dependencies right to the you're not a lot but not we have inline dependencies so we use the Postgres parser we yeah like you've copied that code you need into it a third-party directory and any co-pilot so how do you decide like is it just ad hoc I mean it's just you and honest like how do you decide like I roll we need this feature we can roll our own or we
this feature we can roll our own or we can bring in this dependency if you have like you have a way to chucking that again give you an environment it's mostly what we find would be a reasonable use of our time and how our criteria is also a bit like how easy is it to inline in the first place so we are making these amalgamation builds which means dad's well first off everything has to run under C++ so we actually have spent some time converting see dependencies like the Postgres parts to C++ like the
the Postgres parts to C++ like the actual grammar and the Yaak stuff all of this has to be simple to us and the problem is if he wants include one library you have to include all of its dependencies well so very much depends on what like the cost-benefit trade off right and we kind of just make that ad hoc I would say the ones we use are like the Postgres parser because we don't write our own parser we use the utf-8 proc because we don't want to write our own Unicode framework because this is a
lot of effort and annoying we also use I think we have somewhere Snuppy in there for example because why would we reinvent Snuffy you know like that's where a snap-snappy snap peas I was clapping sorry actually a German song sneeze national people so that's why pronounceable okay yes so then we'll also things I just named to talk so I we know into applause so you
talk so I we know into applause so you hit the applause button I don't do it it's an empty gesture but you want to do that to thank mark so why do we go around and if anybody wants to ask questions you can and then I guess you raise your hand we gonna meet you or
okay questions that um the I was gonna say um Oh for the you didn't talk about how you you're built on top a sequel light or you like you implement the sequel light API today like like so so when you say it's in a better database like your code is what setting sequel
over JDBC - there's no there's no it's it's really just like sequel lights you know like you we have a function called query that takes sequel string and that feeds directly into our engine there is no nothing else we also implement indeed the sequel lights API but that is just like we have the header we have our own implementation which is duck to be under today but we have a dead as well but there's no client we yeah we don't really have any protocol of any kinds we
really have any protocol of any kinds we have some rest server that we use for one thing but it's not really seriously you know it's our focus okay and then so and then for the Fast and Slow test so like command mine I call make you to test whatever thank yous you the fast off and then at night or when you spend to github you run the slow stuff I how does it work it's very much depends on what you want to do so if you want like a basic verification if your system is
not broken you just run the fast one if you want to debug a test you obviously run just that specific one you don't run everything before you push the get up you always run all the like the slow ones they take like maybe 15 20 30 minutes on this depending on your machine so you can just your ability to fire off Travis the bill woody what are you giving the small stuff yes well actually we have a pretty complex Travis set up we run a bunch of different compiler systems of course the runner
our packages piping packages on both Linux OSX windows we run like GC 5 I think it's the lowest one up to sanaya we were on clang different versions we also have some extra verification stuff we run for example we have this vectorized engine we have one build where we run with a vector size of two so our normal vector size of thousands but running with effect size of two gives us actually detects some
other bug bugs then we also have different compilation options so we compile the amalgamation we do like our normal compiled we test all of those in traps like it's I think we have like 17 18 Travis tests we also actually run a because we'll need Solaris to work because there are a bunch of old people that like sell ours for some reason so
that like sell ours for some reason so ok yeah alright any other questions from anybody else um question so you said you implemented the rewrote to find a copy and wait you should introduce yourself hi I am someone who is currently maybe going to rewrite a binder in the near future oh very good he's an honor gratis see of you he's starting his PhD with me in the fall so you will come across mark and when mark will come across land in the
future we're sure yeah as if we still have conferences then we will say yeah so I was kind of wondering what you mean takeaways were for how you should go about writing a binder so that you don't have to rewrite it again and also whether you're kind of annotating existing objects or if you're kind of figuring out what information needs to be generated answering that on the side so it's a bit it's a bit tricky I think
so it's a bit it's a bit tricky I think one thing that is kind of crucial for us was this like software binding which is quite tricky so we started with I think something very similar to your binder which gave us a lot of problems with binding correlated subqueries so we added kind of like this nesting of binders where you have like different nesting phase of the binders but also on the expression level so you have an expression binder that has like
a different layering as well so because for example you can have a sub-query in your select clause that can then bind to the variables that are available in select clause right but then within that you can have another sub query that has then a where clause that has then another software in the where clause and then that one needs to bind in the where clause again and you combine in all these different levels so now on the combo I would require you need to bind the Select Clause the inner query you need to bind the where clause so you have like this layering of these kind of different expression binders depending
on where you are rights because if you are in say a select clause you can bind aggregates but if you're in a where clause you can't so you have this kind of context there I mean I think that's kind of the way we structured our binder now one other thing that's important is that you have to at least for us we want to do statistics propagation as well throughout the binder because we do overflow prevention using those statistics so you have to go from like you have to be able to kind of track those statistics throughout the plant
those statistics throughout the plant right because you don't want to just give up your statistics when you say leave the Select you want to go from okay I have my statistics from the table I go through my filter through my where I go to my select and then you propagate all those statistics your binders doing that yeah actually knots now anymore we did it but this is we need to do this stuff okay we used to have this but we're gonna riad its we I removed it in
one of the reworks okay but it needs to the reason we do or want to or use to do that and want to do that still is that we need to have this for old flow prevention because we want to guarantee like statically can this expression overflow and if it can't then we can use a fast path on the computation right so so by statistics you mean we two together include like the histograms a carnality is of the column budgets min max is no has Unicode you know like
max is no has Unicode you know like because rock for string operations there's usually fast that it was just asking so this things you know how about the other things like his gran canaria TS mission the carnelli estimation is done somewhere else but we don't have a very good one now actually it's one of the features that's kind of missing that's done in the joiner optimize that's not done in the binder kind of a quick question actually going back to what you said earlier when you say it's based off our kind of similar or like look similar
our kind of similar or like look similar to what we do do you mean a little note that you mean that area no I mean peloton but I guess you've got some code similar to that no I know should we report the binder but we wrote it twice so third time's the charm exactly yes do you use the binder to kind of do validation of my I think I just search don't DB also supports like
prepared statements like how does the prepared statement code path differ from the binder code path if it comes to for example validating types so the prepared statements actually one thing how does implement is all but I do know something about the prepared statements they need to have a type right and then the binder we do all the type resolution so also this is one thing that's we had to add as well as like for a function interface that we have where you have like this
that we have where you have like this propagation of all the sequal times I don't know exactly how it works for you guys we have a separate type internally and in sequel and the reason for that is that we have actually the same internal type for different sequel types so for example it needs is just a 32-bit integer which means we have a sequel type date and an internal type integer and all of this sequel type information it only exists during the binding phase
and after that it's all converted to the internal types and the binary does all this type resolution and I kind of forgot your original question whether the type resolution also has to happen for preparedstatement yeah yeah so the same preparedstatement what happens is we have a prepared statement in the binder that has an unknown type and basically when you have like an unknown type and it gets kind of it goes down and when it encounters something that
and when it encounters something that gives it a type we assign that type to the preparer saving and that can be for example a function so if you call like a prepared statement inside a substring and it's the first argument the substring that has to be of our chart right there's no other possibility so then we add a cast to of our chart and then and then the prepared saving becomes a varchar' statement and then you can only assign a var chart type to
that statement if you will right yep that makes sense but so when there WB's MIT license you just look at the binder mhm yeah yeah if you have any questions shoot me an email or we can have like a zoom chat I'm sitting at home all day challenging I don't might answer questions about you okay thank you just a very high-level question I have out of a curiosity right all of all the
components of duck TB elect and the storage execution a transaction or maybe you are yeah there's like the concurrency control code or the binding and catalog which part you find if you can give a rough estimation that is thicker you spend most of time the most difficult to deal with and why this is difficult to say because we have not given each component the same focus so for me personally the most difficult component was the storage and we have
component was the storage and we have not actually the storage is the least polished part of the system I think okay and the reason was so difficult is because our group has not actually done that much in storage so in nettabi which is the one of the other database we have maids they just a map some areas and that's it they call it today I think vector wise they have a very complex storage that was written by basically one guy that has left 15 years ago and I've not personally worked with the storage there but from what I heard it's
kind of a mess but basically we are like as a group focusing a lot on analytical queries query execution these sort of things so for me these came very naturally like quarry execution was quite easy to do which is also why I think this is probably the best component is the execution framework like for the storage I I had to like relearn how to write a buffer pool and such I see thanks for sharing
um this is paulina I'm a master's student in the computer science program I was wondering if you talked a bit about you said that you have implemented compressed execution and then you're also working on doing compressed storage and I was wondering if you could tell about that yeah so we actually the way we have implemented compressed execution is kind of part thing our vectors that propagate through the systems they have
difference they have a type as in they have an integer type or a string type but they also have a vector file which is the actual physical representation and the physical representation can be anything it can be a dictionary compressed it can be like a flat array it can be a single constant as well so a single cause it's like logically replicated but physically wants and these vectors they kind of flow through the system in a compressed in what's
whatever type you want and then whenever you get to an operator the operator can decide what to do with it so we have a methods that we call horrify because it was came up with by Ori airlin from Facebook he came out this and it basically creates like a thin wrapper over a vector it gives you like an iterator over it that is quite efficient for common cases and then we have some specialized code to handle like different types of compression but it's
different types of compression but it's very like modular so for any function where you want to implement compressed execution or any operator you can just kind of add a specific type of compressed execution to that so we don't have everything yet like we don't have I'm here everything we probably don't even want everything but it's like the so in the case of dictionary compression
so in the case of dictionary compression are you passing the dictionary along with the vector or are you yeah so the vector the vectors in our system they have shared pointers to a buffer basically and the buffer can hold anything it's not just a it's kind of a Miss mean it's not just a bunch of my it can also hold a dictionary it can vote like a lock as well so we can have like a lock on something there or we can have a pin buffer pool page and it's all like
a pin buffer pool page and it's all like a shared vector it share pointer and vectors and that basically means that we can have like a dictionary that's referenced by a vector and as long as that vector is alive the dictionary will also always also be alive it's going to actually have to physically go along with the vector just be a pointer but it can be kept alive as long as the vectors like okay and when you say you're working on compress storage what are you thinking about well there's a bunch of different things I think from our group
different things I think from our group we have the P for Delta which will go in there probably for sure actual Peter came up with a new string compression method which also will go in there for method which also will go in there for sure sure besides that probably some simple things like arleigh or dictionary and such for now all our all our storage is just flat but we are adding like at least like the the simple things you know like dictionary or Li the P for Delta maybe not that simple string compression so we're not gonna use generic compression
we're not gonna use generic compression like we're not gonna use Snuppy or anything for example we're just snappy snappy sorry it snappy thank you very snappy sorry it snappy thank you very good good usually I'm the person correcting people's pronunciation at our offices so yeah what uh what are your catalogs look like is it just are they table those amazing seven data structures they're our catalog is just a bunch of unordered our catalog is just a bunch of unordered Maps Maps
Maps but they are transaction managed so we actually have a separate code path for our catalog but it is fully managed by the MCC because we really do want to have things like create table statements being tied to transactions of course but the main reason we went with this like sort of different representation is because we had a lot of problems in Minetta B with this because in Manatee they use the main storage engine to also
handle the catalog but the storage engine in Minetta B is very optimist which means if you so much look if you someone just look at the same table from two different threads the whole system just blows up like it's it starts like undoing things like saying like all transaction aborted there was a potential conflict you know and that got really problematic when you wanted to handle the catalog because if two threads would start transaction create a table they would already have a conflict
so we kind of took away from that that if you're doing optimistic concurrency control maybe you don't want to do that from the cowl so mmm okay we are in are in the catalog and they are transactional or Sarah catalogs are in tables than they are transactional so it sounds like it was sort of an engineering limitation of money to be rather than some more fundamental no they could be stored as tables as well
it's just that's Rennie what I'm saying but whether the issue of like we described in a dvd-roms it sounds like it was an engineering problem not a fundable it's I think either way it's kind of fine there is some duplicate code that we have in the catalog that deals with the transaction stuff yeah because the handles is kind of in a different way I think if you're making a system that's pretty good with like points updates like that it's not super optimistic its
like that it's not super optimistic its finest stores tables as well I think there's no fundamental objection to that but I think for like a optimistic engine like we ran into these problems and for us it's also kind of easy to just have like a nun word map so you know like it's it's kind of simple in a way as well uh it's easier to pretty strapped because it's no chicken for the egg problem um yeah all right I have two more questions and then we should let you go because it's it's
um so you make sure that you're using shared pointers have you found that to ever be either a performance issue or just use use unique pointers almost everywhere okay the only the only case where we use SharePoint errs is actually an inspector buffer and I I thought about it a lot and we ended up doing it because we I saw no other good paths to implement this and the reason for that is that the thing is if you have a vector that has like a dictionary for
example we have like operations that just reference a vector and you have to have somehow some sort of shared ownership there or you transfer the ownership but sometimes it's a bit iffy because if you have for example a case statements a case statement gets two vectors as inputs or more but it could be two if you then have strings you don't want to copy all the strings just want to it kept like pointers to them right so because of that you can't you have like the case table that has relies
on this sort of string area and you have the original vector that realized in the string area so it is like inherently a shared thing because you can have multiple vectors depending on the same sort of data but for the rest of the system we all use unique pointers and I did I really think that was the good choice as well like unique pointers everyone we uh in in Palestine we just it was it was a wreck we are doing something different in our new system we don't use shared pointers we actually
mirror and pointer type and we do we manage memory separately my last question is it's an embedded system so it's single threaded right or like how does like right so right now it's single does like right so right now it's single threaded Python right like sorry because you're running in Python you just can't like you've never thread to dim it no no you can you can launch any threats you want course there's nothing stopping you from
launching threads from the C code and piping like only if you want to interface with Python things do you have to actually respect this but you can easily run the database engine which we also will do so dr. B it is currently in single threaded per query but we have full support for concurrent queries and such so you can run in updaters and inserters and readers however many wants and most of it is not going to interfere with each other unless you do like
with each other unless you do like single updates that point heavy workloads then you have a lot of experience but that's also not what we care about what about my background threads again so like if you if if it's an embedded system and you told you you'd only have one thread or like I guess they do where they want right like yeah so we it's basically we are yeah people can do whatever they want with the system that's kind of it's it's it's a chat it's somewhat of a challenge
because people can mess with it of course well yeah we won once we add the multi-threading we will allow the user to limit it as well because they might want to you don't want to run for eight cores on your laptop for no reason right maybe a database it's not that important always you know so anything you have checkpoints like you have a background friend take a checkpoint no no yeah so this is one thing that we kind of decide
on as well is we don't want any background things okay so the checkpoints actually are pretty ugly but they will just trigger once the database reaches a certain size like there and they will just be part of the whatever made the database exceed that threshold so we made the cards that you should not have any backgrounds anything like back on GC no no no GC okay everything is run
just inside the the query like if you call dr. B dot query or whatever everything in ten that's all dr. people do like there's no background anything like it's kind of that kind of a call right philosophy a little bit more just curious can you elaborate the D Syrian are not having any background thread a little bit more yeah so the thing is that because we're running so there's actually I have some
running so there's actually I have some extra slides on this okay I'm never gonna find this yeah resource sharing okay I guess it's this slide okay you guys still see the slides oh yes Oh mmm unskilled okay researcher yeah so resource sharing it's kind of a big thing for us and not for standalone servers because if you run a salient database server you have no it's yours basically like you take up the whole
basically like you take up the whole computer but for us we don't just run on someone's laptop probably we run in another process so we really want to kind of respect this boundary somehow as well and not just go like you don't want your Python process that you are typing interactive queries in so just randomly hang for a few seconds because your database decides needs to check points you know or like your database decides aha you haven't done anything inside the aha you haven't done anything inside the database database I'm gonna check points which makes a lot
of sense for standalone server but not so much other ones and we have been looking a bit into adaptively doing resource sharing as well but it's something that's pretty tricky like it's not probably not something we're gonna really look into but we have considered this okay [Music]